Efficient Specific-to-General Rule Induction
نویسنده
چکیده
RISE (Domingos 1995; in press) is a rule induction algorithm that proceeds by gradually generalizing rules, starting with one rule per example. This has several advantages compared to the more common strategy of gradually specializing initially null rules, and has been shown to lead to significant accuracy gains over algorithms like CGRULES and CN2 in a large number of application domains. However, RISE’s running time (like that of other rule induction algorithms) is quadratic in the number of examples, making it -.-,... :r..l.L. A-..” . .._ ̂-^-:.., ., ^..., l,..,, sl,+..l.,,,, UIIJulba,“lG I”1 yr”LG;uJurF, “ClJ raqyG UQtJIL”rWG;U. This paper introduces a method for reducing RISE’s running time based on partitioning the training set, evaluating rules from one partition on examples from another, and combining the final results at classification time. Partitioning guarantees a learning time that is linear in the number of examples, even in the presence of numeric attributes and high noise. Windowing, a well-known speedup method, is also studied as applied to RISE. In low-noise conditions, both methods are successful in reducing running time whilst maintaining accuracy (partitioning sometimes improves it significantly). In noisy conditions, the performance of windowing deteriorates, while that of partitioning remains stable. introduction and Previous -Work Rule induction is one of the major technologies underlying data mining. Given a number of classified examples represented by vectors of symbolic and/or numeric attributes, algorithms like C4.5RULES (Quinlan 1993) and CN2 (Clark & Niblett 1989) produce sets of “if . . . then . ..” rules that allow us to predict the classes of new examples by performing tests on their attributes. However! the running time of these algorithms is typically quadratic or worse in the number of examples, making it difficult to apply them to the very large databases that are now common in many fields. In C4.5RULES, noise can lead to a cubic running time (Cohen 1995). While some fester variants of rule induction have been proposed (Fiirnkranz & WiAmm. ~QCIA. f?!nhm lOOr;) nnnaa ~phi,wc. the irleal 7.1U111”I A”“*) V”1I”II A”““,, 1LV1.U U”II.U.Y “II” A..“_ goal of linear time. An alternative approach, and the one that is followed in this paper, is to employ some form of sampling, like windowing (Catlett 1991; Quinlan 1993), peepholing (Catlett 1991), or partitioning (Chan & Stolfo 1995). While often (though not always) reducing running time, sampling techniques can sometimes substantially reduce accuracy, and there may be a trade-off between the two. A reducedaccuracy rule set is preferable to a more-accurate one that is never reached due to lack of time; ideally, however, the loss in accuracy should be as small as possible, given the available time. RISE (Domingos 1995; in press) is a rule induction algorithm that searches for rules in a specific-to-general direction, instead of the general-to-specific one used by most rule learners. This has several advantages, among them the ability to detect with confidence a higher level of detail in the databases, and a reduction of sensitivity to the fragmentation (Pagallo & Haussler 1990) and small disjuncts problems (Holte, Acker, & Porter 1989). In a study comparing RISE with several induction algorithms (including C4.5RULES and CN2) on 30 databases from the UC1 repository (Murphy & Aha 1995), RISE was found to be more accurate than each of the other algorithms in about two-thirds of the databases, in each case with a confidence of 98% or better according to a Wilcoxon signed-ranks test (DeGroot 1986). RISE also had the highest average accuracy and highest rank. RISE’s running time, like that of previous algorithms, is quadratic in the number of examples, and thus the question arises of whether it is possible to reduce this time to linear without compromising accuracy. This paper proposes, describes and evaluates the application of windowing and partitioning to RISE; in both cases, this raises issues and opportumties that are not present in general-to-specific systems. The next three sections of the paper describe pure RISE, RISE with windowing, and RISE with partitioning. This is followed by an empirical study comparing the t.hree anrl Aicrnasinn nf t.he rea11t.a yaav y’a-y”, _... y.I”yII*-~~ VA “A.-“.,...“.,. Rule Induction Q Decision Tree Induction 319 From: KDD-96 Proceedings. Copyright © 1996, AAAI (www.aaai.org). All rights reserved.
منابع مشابه
The Impact of Goal Specificity on Strategy Use and the Acquisition of Problem Structure
Theories of skill acquisition have made radically different predictions about the role of general problem-solving methods in acquiring rules that promote effective transfer to new problems. Under one view, methods that focus on reaching specific goals, such as means-ends analysis, are assumed to provide the basis for efficient knowledge compilation (Anderson, 1987). whereas under an alternative...
متن کاملThe Place of Fault on the Civil Liability of Physicians and its Comparison with the General Rule of Civil Liability in Jurisprudence and Law of Iran
Whether all the actions of the physician in relation to the patient are considered to be the subject of a single person, or are it separable, in which case each one is subject to general rules or to some particular rule? In analyzing the functions of the physician during the course of cognitive therapy, applicants who discuss medical professional responsibilities examine medical practices of a ...
متن کاملA General Rule for the Influence of Physical Damping on the Numerical Stability of Time Integration Analysis
The influence of physical damping on the numerical stability of time integration analysis is an open question since decades ago. In this paper, it is shown that, under specific very general conditions, physical damping can be disregarded when studying the numerical stability. It is also shown that, provided the specific conditions are met, analysis of structural systems involved in extremely hi...
متن کاملDETERMINATION OF OPTIMAL HEDGING RULE USING FUZZY SET THEORY FOR MULTI-RESERVOIR OPERATION
To deal with severe drought when water supply is insufficient hedging rule, based on hedging rule curve, is proposed. In general, in discrete hedging rules, the rationing factors have changed from a zone to another zone at once. Accordingly, this paper is an attempt to improve the conventional hedging rule to control the changes of rationing factors. In this regard, the simulation model has emp...
متن کاملGenetic Algorithms and Poker Rule Induction
This thesis is intended to be an experiment on Genetic Algorithms (GAs). The general algorithm is applied to a specific dataset of poker hands. Its performance in detecting the rules of poker games is then evaluated in different measures and conditions. Based on the performance of the algorithm, further understanding of the data can be obtained.
متن کاملEffective Induction of Hairy Roots in Persian Poppy (Papaver bracteatum Lindl.) Using Sonication Method
Background: Papaver bracteatum Lindl. Commonly known as Iranian poppy is an important medicinal plant due to the presence of benzylisoquinoline alkaloids. Objective: To evaluate the hairy root culture as a novel method for thebaine production. Methods: To optimize the hairy root culture of P. bracteatum, five strains of Agrobacterium rhizogenes (ATCC15834, C318, A13, 9453 and A4) were used by...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1996